Take Home Assignment - Origin Financial
Overview
This report aims to show all processes developed to analyze the data challenge proposed by the Origin Financial team. It’s organized into four main sessions:
The first session will be Engineering Raw Data where I’ll describe all the steps needed to clean raw data. The following step, Exploratory Data Analysis has some descriptive analysis about the data. At this point, I’ll guide my analysis to data clustering based on RFM (Recency Frequency and Monetary). After that, I’ll show the clustering analysis at session Clustering and some insights based on it. Finally, I have some notes and future ideas further some conclusions about the challenge.
Engineering Raw Data
Before starting the data analysis, let’s do a quick review of available data sets and raise the need for pre-engineering.
Customers
## Rows: 803
## Columns: 7
## $ id <chr> "131d8363-e2a6-4c60-acef-da03e6bfbdc4", "831216da-744b-4…
## $ created_at <chr> "2020-03-11 19:38:35", "2020-05-01 13:09:00", "2020-03-1…
## $ date_of_birth <chr> "1990-08-16", "1933-11-22", "1986-02-02", "1990-01-01", …
## $ gender <chr> "male", "", "", "male", "", "", "", "male", "male", "mal…
## $ country <chr> "US", "US", "US", "US", "US", "US", "US", "US", "US", "U…
## $ state <chr> "CA", "WA", "NY", "CA", "", "CA", "", "CA", "NY", "CA", …
## $ city <chr> "San Francisco", "Tonasket", "New York", "San Francisco"…
In a quick overview, it’s possible to note that variables created_at and date_of_birth should be modified to timestamp and date, respectively. Also, we can note that all missing data are represented by empty character space and it will be replaced by NA.
The variables state and city have some strange characters like this: <img src='#' onerror=alert('xss') /> I presume that this behavior is caused by a css operator and it’ll be replaced by NA.
The variable id it’s lowercase, while in the transactions data it’s upper-cased, so it’ll be replaced by uppercase.
Transactions
## Rows: 11,059
## Columns: 17
## $ X_id <chr> "4D2119A1-03D9-48F2-99B7-FEDD…
## $ user_id <chr> "94102846-0B6D-45D8-AFD1-DA80…
## $ account_id <chr> "88F8D694-AFF0-4CFC-BBB6-C89D…
## $ account_name <chr> "Chase - Plaid Money Market (…
## $ description <chr> "ACH Electronic CreditGUSTO P…
## $ type <chr> "expense", "expense", "expens…
## $ amount <dbl> -5850, -5850, -5850, -5850, -…
## $ date <chr> "2021-06-08T00:00:00.000Z", "…
## $ extra_fields.category.0 <chr> "Transfer", "Transfer", "Tran…
## $ extra_fields.category.1 <chr> "Debit", "Debit", "Debit", "D…
## $ extra_fields.category.2 <chr> "", "", "", "", "", "", "", "…
## $ extra_fields.category_id <int> 21006000, 21006000, 21006000,…
## $ extra_fields.merchant_name <chr> "", "", "", "", "", "", "", "…
## $ extra_fields.name <chr> "ACH Electronic CreditGUSTO P…
## $ extra_fields.payment_channel <chr> "other", "other", "other", "o…
## $ extra_fields.payment_meta.payment_method <chr> "ACH", "ACH", "ACH", "ACH", "…
## $ created_at <chr> "2021-06-18T19:14:32.075Z", "…
The transaction data doesn’t have so much engineer to do. It’s just changed timestamp and date types and replace empty spaces with NA. I also, replaced . with _ in some variable names.
Exploratory Data Analysis
The exploratory data analysis it’s divides into univariate analysis before RFM transformation and more insightful after that.
| Name | db_users |
| Number of rows | 803 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| character | 6 |
| Date | 1 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| id | 0 | 1.00 | 36 | 36 | 0 | 803 | 0 |
| created_at | 0 | 1.00 | 2414 | 5633 | 0 | 803 | 0 |
| gender | 751 | 0.06 | 4 | 10 | 0 | 4 | 0 |
| country | 0 | 1.00 | 2 | 2 | 0 | 1 | 0 |
| state | 179 | 0.78 | 2 | 2 | 0 | 27 | 0 |
| city | 176 | 0.78 | 4 | 21 | 0 | 120 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| date_of_birth | 173 | 0.78 | 1000-01-01 | 2001-01-01 | 1989-12-12 | 119 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| age_at | 173 | 0.78 | 33.56 | 40.21 | 19.41 | 29.63 | 31.08 | 33.99 | 1020.33 | ▇▁▁▁▁ |
At this simple description of the data set, it’s possible to note that age has an outlier that will be replaced by NA. Also, the great number of missing data in the variable gender, turning it almost useless in the data analysis.
Univariate Analysis
Age
The customer’s age distribution is concentrated between the range 30 - 50 years as shown below:
Maybe this variable could be a good one to use in the clustering process.
State
The geographical distribution of customers is dense in the California State (31%) followed by MD(11.1%) add NC/NY tied with (7.8%). It’s important to note that more than 22% of the customers have no state information.
Users Transactions
Looking at the user transactions, I note a tiny number of different customers, just 84. Some of them (almost 20%) have more than one account, but as we expected to analyze customer behavior over account behavior I’ll treat them as the same customer.
| Number of Accounts | Number of Customers | % of Customers |
|---|---|---|
| 1 | 70 | 83.33 |
| 2 | 7 | 8.33 |
| 3 | 1 | 1.19 |
| 4 | 2 | 2.38 |
| 5 | 2 | 2.38 |
| 6 | 1 | 1.19 |
| 7 | 1 | 1.19 |
At this point it’s really important to say that just 84 customers to do the cluster analysis could be a problem for a bunch of reasons as like:
Some algorithms won’t provide robust results under small samples
The cluster interpretation could be a little messy.
The pattern recognition it’s hard to do
Said that I’ll develop my analysis and won’t concern about it.
Type of transacions
Analyzing the type of transactions we can note that has a mix of three different transactions. Income, Expense, and Transfer, as I’ll show in the next steps, will be treated separately to do the clustering analysis.
The amount of money transactions types have a strange behavior having a low variety of values. Transfer for example has just the value 25. It’s something that could affect the behavior of clustering analysis.
Looking at the number and amount of transactions per day by expenses, transfer and income it’s possible to note that the days 2021-06-15 (income and expense) and 2021-06-16 (income) has a lot more transactions compared to other days. The expense category it’s almost 10x the median for the period and the income it’s more than 10x.
I don’t know if this strange behavior comes from a selection bias or if this is from the data. I’ll presume that this behavior is a normal one.
Transactions Category
Looking at the transactions category we can note that we have a lot of missing ones, but part of that is caused by income transactions.
| extra_fields_category_0 | extra_fields_category_1 | extra_fields_category_2 | extra_fields_category_id | n | perc |
|---|---|---|---|---|---|
| NA | NA | NA | NA | 2742 | 24.79 |
| Travel | Taxi | NA | 22016000 | 1230 | 11.12 |
| Food and Drink | Restaurants | Fast Food | 13005032 | 1057 | 9.56 |
| Travel | Airlines and Aviation Services | NA | 22001000 | 1055 | 9.54 |
| Food and Drink | Restaurants | NA | 13005000 | 1053 | 9.52 |
| Payment | Credit Card | NA | 16001000 | 623 | 5.63 |
| Food and Drink | Restaurants | Coffee Shop | 13005043 | 614 | 5.55 |
| Transfer | Credit | NA | 21005000 | 607 | 5.49 |
| Payment | NA | NA | 16000000 | 443 | 4.01 |
| Recreation | Gyms and Fitness Centers | NA | 17018000 | 443 | 4.01 |
| Shops | Sporting Goods | NA | 19046000 | 443 | 4.01 |
| Transfer | Debit | NA | 21006000 | 425 | 3.84 |
| Transfer | Deposit | NA | 21007000 | 324 | 2.93 |
Almost 20% of transactions came from the travel category, divided into taxi (Uber) and Airline and Aviation.
Food and drink represent 25% divided in restaurants like coffee shops and fast foods.
Also has some transactions represented by recreation and shops and some transfer between accounts
Merchant Name
Looking deeper at the expense types and merchant names, we can see at the top Uber, McDonald’s, Starbucks, and so on.
Channel Transactions
The preferential channel to expenses is in-store (73%). Would be interesting if we had some information about online channels as app payment, website shop, etc.
Payment Methods
By payment methods we have just 4% of transactions with this data, it’s almost useless to the analysis.
RFM Transformation
After this exploratory analysis about variables and looking to a solution to the clustering problem. I decided to transform the transactional data into recency frequency and monetary value for the three types of transactions: Expense, Income, and Transfer.
After this transformation our dataset looks like this:
## Rows: 84
## Columns: 33
## $ id <chr> "AD7226ED-2D26-45FD-AB37-C3823…
## $ state <chr> "NY", "CA", "IL", "NY", NA, "V…
## $ age_at <dbl> 34.10458, 32.52759, 33.34352, …
## $ count_transactions_expense <dbl> 121, 120, 144, 538, 270, 0, 0,…
## $ amount_expense <dbl> 2825.37, 15654.90, 18785.88, 2…
## $ amount_mean_expense <dbl> 23.35017, 130.45750, 130.45750…
## $ count_transactions_income <dbl> 24, 210, 252, 50, 24, 6, 6, 6,…
## $ amount_income <dbl> 12000.00, 585798.60, 702958.32…
## $ amount_mean_income <dbl> 5.000000e+02, 2.789517e+03, 2.…
## $ count_transactions_transfer <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ amount_transfer <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ amount_mean_transfer <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ n_transactions_transfer <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ n_transactions_food_and_drink <dbl> 72, 0, 0, 246, 122, 0, 0, 0, 0…
## $ n_transactions_payment <dbl> 0, 0, 0, 48, 25, 0, 0, 0, 0, 0…
## $ n_transactions_recreation <dbl> 0, 0, 0, 50, 24, 0, 0, 0, 0, 0…
## $ n_transactions_shops <dbl> 0, 0, 0, 48, 25, 0, 0, 0, 0, 0…
## $ n_transactions_travel <dbl> 49, 0, 0, 146, 74, 0, 0, 0, 0,…
## $ transactions_amount_transfer <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ transactions_amount_food_and_drink <dbl> 2537.52, 0.00, 0.00, 53286.50,…
## $ transactions_amount_payment <dbl> 0.0, 0.0, 0.0, 99768.0, 51962.…
## $ transactions_amount_recreation <dbl> 0, 0, 0, 3925, 1884, 0, 0, 0, …
## $ transactions_amount_shops <dbl> 0, 0, 0, 24000, 12500, 0, 0, 0…
## $ transactions_amount_travel <dbl> 287.85, 0.00, 0.00, 24573.84, …
## $ transactions_amount_mean_transfer <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ transactions_amount_mean_food_and_drink <dbl> 35.24333, 0.00000, 0.00000, 21…
## $ transactions_amount_mean_payment <dbl> 0.0, 0.0, 0.0, 2078.5, 2078.5,…
## $ transactions_amount_mean_recreation <dbl> 0.0, 0.0, 0.0, 78.5, 78.5, 0.0…
## $ transactions_amount_mean_shops <dbl> 0, 0, 0, 500, 500, 0, 0, 0, 0,…
## $ transactions_amount_mean_travel <dbl> 5.87449, 0.00000, 0.00000, 168…
## $ no_channel <dbl> 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, …
## $ other <dbl> 0.1655172, 0.0000000, 0.000000…
## $ in_store <dbl> 0.8344828, 0.0000000, 0.000000…
Just 84 data points (customers) and 33 variables. I drop the recency variables because I think that isn’t the goal of the analysis to explore “how much time the customer doesn’t have any transaction”.
Expense vs Income
Firstly, analyzing the behavior of the average amount of transactions in expenses and income we can note some behaviors.
Note that some customer has high average income and low average expense, while we have others that expend more than earn. This is a good spoiler about the clustering.
It’s important to note the outlier that expends much more than earn, I will keep this customer in analysis because it’s a different behavior that could be more representative if we have more data.
Expense Vs Income Vs Age
Adding the variable age to our analysis, it’s possible to note that it doesn’t affect the behavior of income and expense, so this variable wouldn’t be useful in the cluster analysis.
Expense vs Income vs Transfer
Transfer average amount it’s another variable that doesn’t affect the income and expense. As shown below:
Relationship between categorys of expenses
Looking at the correlation between category expenses it’s possible to note that every transaction has a high correlation. This analysis isn’t so conclusive because we have just 13 customers with expenses categories.
Also, I will try to do a cluster with these data points, the main idea is to show my way of thinking about the problem then try to interpret something. Because just 13 customers isn’t enough data for any analysis.
Clustering
At the clustering analysis, my goal is to make two types of clusters, one based on incomes and expenses and another one based on the categories of expense. I think this way could give us some conclusions about the data.
It’s important to say that I use just one method of clustering called kmeans, it’s based on centroid distances. This choice was made because the data sets have few data points and I would like to maintain the analysis as simple as possible.
At the beginning of every clustering, I will analyze to determine the optimal number of clusters, after that the cluster analysis will be made, and also insights about the clusters’ responses.
To determine de optimal number of clusters I run tree methods:
Elbow method - The idea behind this method is to compute the within-cluster variation to different number of clusters and minimize it.
Silhouette Method - It measures the quality of clustering based on the average similarity of each point to its own cluster compared to other clusters.
GAP method - The gap statistic compares the total intracluster variation for different values of k with their expected values under null reference distribution of the data (i.e. a distribution with no obvious clustering).
Type Transactions
Expense vs Income
We can visualize the results of methods to determine the optimal number of clusters. Two of them indicates the best number of cluster should be 6 and the elbow method indicates 4 clusters.
Using 4 clusters we can have some insights about the customers:
1 - Customers with no income and no expense
2 - Customers with medium income and medium expense
3 - Customers with high income and low expense
4 - Customers with no income and high expense
| Type Clustering | Cluster | Number Customers |
|---|---|---|
| cluster_4 | 1 | 1 |
| cluster_4 | 2 | 9 |
| cluster_4 | 3 | 5 |
| cluster_4 | 4 | 69 |
| cluster_6 | 1 | 4 |
| cluster_6 | 2 | 5 |
| cluster_6 | 3 | 1 |
| cluster_6 | 4 | 66 |
| cluster_6 | 5 | 5 |
| cluster_6 | 6 | 3 |
Note that using 6 clusters we have few customers in some clusters and it can cause the suggestion of the methods because the distance between members of cluster has more chance to be tiny.
Expense vs Income vs Freq Expense
Adding the frequency of expenses we can visualize better the clusters.
The methods indicate 5 or 6 clusters, I used 5 in the analysis:
Note the interpretation about the clusters:
1 - Customers with medium income, lot of expense transactions, and medium amount average expense;
2 - Customers with no amount expense;
3 - Customers with medium income, medium transactions expense, and medium amount expense;
4 - Outliers, the customer that has high amount expense in few transactions and has no income;
5 - Customer with high income and low average amount expense with medium transactions.
At this point, we have some good interpretations about users transactions. Let’s see about expenses categories.
Category Expense
We have just 18 customers with expenses, so it’s hard to extract some helpful interpretation or relationship in the data. But let’s look to clustering methods.
At this point, I use the relative amount mean expend to the total amount. The same analogy to the frequency.
Amount Type Expense
The methods indicate just 2 clusters relatives to the amount mean expense in each category.
| Cluster | Transfer | Food and Drink | Payment | Recreation | Shops | Travel |
|---|---|---|---|---|---|---|
| 1 | 0.09 | 0.30 | 0.18 | 0.01 | 0.04 | 0.07 |
| 2 | 0.69 | 0.08 | 0.15 | 0.01 | 0.04 | 0.04 |
The only difference between groups looks that Group 2 expends more on transfers and group one on food and drink. But nothing too conclusive.
Frequency Type Expense
The methods indicate just 2 clusters relatives to the amount mean expense in each category.
| Cluster | Transfer | Food and Drink% | Payment | Recreation | Shops | Travel |
|---|---|---|---|---|---|---|
| 1 | 0.09 | 0.31 | 0.04 | 0.04 | 0.04 | 0.20 |
| 2 | 0.15 | 0.37 | 0.07 | 0.07 | 0.07 | 0.22 |
There aren’t any conclusions about the groups, the expenses are really similar.
Conclusion
Our goal with this analysis was to find some group of customers that has similar behavior in the transaction data. I led the analysis using the approach o frequency and monetary value expenses, incomes, and transfers. Initially, we had 803 customers to cluster after they join between data sets left just 84 customers with the transaction.
Based on it, I achieve some good results in clustering customers using income and expenses. A couple of behaviors were detected. Unfortunately using the expenses data there was no conclusion about the groups, it’s an important stand out that we had just 18 customers, something that makes it hard to reach closure.
Notes
This take-home assignment was very challenging because it’s an open problem and I could have a lot of different ways to do the analysis, My choice was to explore the RFM customer behavior and in the end just FM.
I choose to explore just one method of clustering to make things simple. But in a workday problem, I would test some other methods like hierarchical clustering and principally density-based clustering like Dbscan and Hdbscan, because this method has the advantage of not classifying some data points with the idea of “Not all customers belongs a cluster”.
I would like to thank the Origin team for dedicate time to elaborate on this problem and give me the chance to show my work.